Predicting a Country's Population Density
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import lit
from pyspark.sql.types import (
StringType,
StructField,
StructType,
FloatType,
)
import warnings
from operator import attrgetter
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=RuntimeWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.simplefilter("ignore")
%%local
import matplotlib.pyplot as plt
from tqdm import tqdm
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'
%%javascript
require.config({
paths: {
plotly: 'https://cdn.plot.ly/plotly-latest.min'
},
});
The World’s Population is growing every second at the same time the clock is ticking. As of November 2020, the world’s population is estimated to be around 7.8 Billion. With each second having a new birth, there is always a possibility of overpopulation in a lot of areas in the world. For a country, it is very crucial to determine the statistics behind its population. The most relevant question for us now is: Can we predict a country’s population density by using the demographic category and coordinates of a location?
Population density is the measurement of population per unit area. It simply refers to the number of people living in an area, for example, a household. This research will be tackling on the different population density values of 3 Countries which are Philippines, Japan, and South Korea.
The research paper will be creating different machine learning models that will predict the population density.ML Models used for the research are as follows: a. Linear Regressor b. Decision Tree Regressor c. Random Forest Regressor
Using Apache Spark with m4.large instances with 8GB of RAM in AWS EMR, then visualizing, preprocessing 11 gigabytes of raw data, and analyzing the results, the research paper was able to conclude significant insights. It was observed that for Japan, South Korea and Philippines, it is possible to predict the population density using the latitude and longitude parameters and also the demographic parameter. It is also observed that among the ML Models used, the Decision Tree Regressor was able to predict the high accuracies with respect to all countries.
For a country, it is very crucial to determine number of population it has. One of the most important factors in considering the economic life of a country is checking the country’s growth in terms of population. One thing to note is that, when there are a lot of people in certain area, all other resources will either be in decrease or in increase. If there would be a large number of people living in certain household, chances are that the utilities like water, food, electricity, and internet usage would be much higher than a smaller number of people. In the case of Tokyo, Japan, rent is notoriously high due to the population density of the city. At the same time, the available resources for such utilities would decrease which is the law of supply and demand. On the other hand, having population density concentrated in certain areas leaves nature untouched by people, maintaining mother nature's beauty for us to appreciate, which is the case in many areas of our country, the Philippines. This could also aid in a country's economy by attracting travelers by developing the tourism industry. Therefore, countries are very keen on checking the population growth rate in different areas within their jurisdiction.
One way of checking the effect of population is checking the population density. Population density is the measurement of population per unit area. It simply refers to the number of people living in an area like for example, a household. In checking the population density, there are certain factors to consider. A lot of countries check if their population density is too high, which might lead to overconsumption of resources and diminishing supply. Countries can also check if their population density in some areas is too low, indicating lagging development in sparsely populated areas. Indeed, population density has a role in economic development and perhaps further studies can be studied in this aspect[2].
The High Resolution Population Density from Open Data in AWS's Open Data Registry is a set of population density data for a selection of countries from Facebook Connectivity Lab and Center for International Earth Science Information Network – CIESIN – Columbia University. This dataset estimates the number of people living within 30-meter grid tiles.In this research, we will try to create a machine learning model to predict the population density of each country
Population Prediction: This research observes the possible prediction of population and population density in the given countries.
Economic Overview: This research can be used to analyze the economic effect of population density in the given countries.
Budget Allocation: For the Philippines, this research can help prescribe government spending and resource allocation where it can have the most impact or where it is needed most.
To properly address the problem, the researchers will use a portion of the Facebook - CIESIN dataset. Data from select countries such as Japan, South Korea, and the Philippines will be used. Some information such as the country and demographic are in the file name so columns will be added during processing. The researchers will follow the general workflow defined below to arrive at a conclusion and recommendations.
Each step will be discussed in detail in the following sections. To give a general overview of the methodology, a brief description for each step is described below:
The filepath for the dataset are as follows:
aws s3 ls s3://dataforgood-fb-data/ --no-sign-request
Documentation Link:
https://dataforgood.fb.com/docs/
Please note that there are more files that can be used for the project under the chosen dataset but to minimize the scope of the research, only files for select countries will be used. The file sizes were checked by downloading and decompressing them locally, arriving at a total of about 11GB.
The gathered data will be cleaned and formatted as required, prior to the analysis. The steps involved are:
a. Decompressing gzip files
b. Merging of Different Files (Using Spark)
For this dataset in particular, no cleaning and no transformation was required. Multiple files were merged to arrive at the final complete dataset.
The Processed data will be parsed through different Machine Learning Models to try to determinie if it is possible to predict population density using the features at present.
Processing for the Data before going the ML Models will be done as follows:
1. String Indexer Transformation
2. Vector Assembler Transformation
3. Minmax Scaling Transformation
ML Models that will be used:
1. Linear Regressor Model
2. Decision Tree Model
3. Random Forest Model
The data used for the study was sourced from the AWS Open Data Registry, and can be found by searching for the data titled "High Resolution Population Density Maps + Demographic Estimates by CIESIN and Facebook". It contains almost 27GB of files, many of which are CSV files that have coordinate and the population density columns. The dataset has population density data for many countries. To minimize this research, we would be only focusing on the following:
1. Philippines
2. South Korea
3. Japan
We will be creating different Machine Learning Models for each country to try and predict the population density.
Picture of File Size
Picture of Instances with Instance Type and Workers
Picture of Dashboard showing Workers
Since the dataset can be considered big due to the total size of the files and the sheer number of rows and columns, an emr spark cluster is used to perform the preprocessing. It can seen that there are several folders for each of the three countries. These folders correspond to each demographic category. To process the data, the data will be appended into one consolidated spark dataframe which will have a specific column for the demographic and the country it belongs. In this way, a single dataframe of the whole dataset will be obtained.
The data processing infrastructure is composed of a spark created in AWS EMR consisting of a master,and 3 workers. EC2 instances of type m4.xlarge with 64GB storage was used.
The command below will connect the notebook to the existing spark cluster and also some codes to preprocess the files.
def sampling(frac):
# Return dictionary of stratified sampling for each demographic
return {
'children_under_five': frac,
'elderly_60_plus': frac,
'men': frac,
'total_population': frac,
'women': frac,
'women_of_reproductive_age_15_49': frac,
'youth_15_24': frac,
}
def get_data(df, country, frac):
# Return top 100 dense locations plus a stratified sample of the data
top = spark.createDataFrame(df.filter(df.country == country).rdd.top(
100, attrgetter("population")), struct)
sample = df.filter(df.country == country).sampleBy(
'demographic', sampling(frac))
return top.unionAll(sample)
counry_frac = 0.00005
sample_jp = 0.00002
sample_kr = 0.00006
sample_ph = 0.0001
%%local
def make_map(data, title, zoom=6, height=600, color='demographic',
color_continuous_scale='balance', **kwargs):
# Return map figure of data with title
fig = px.scatter_mapbox(
data, title=title, zoom=zoom, color=color, height=height,
lat="latitude", lon="longitude", size='population',
**kwargs)
fig.update_mapboxes(style='carto-darkmatter')
fig.update_layout(title_font_size=25, title_font_color='blue')
return fig
%%spark -o df
# Read eahc source file from s3 and consolidate into a dataframe
prefix = "s3://dataforgood-fb-data/csv/month=2019-06/"
COUNTRIES = ['JPN', 'KOR', 'PHL']
DEMOGRAPHICS = [
'children_under_five',
'elderly_60_plus',
'men',
'total_population',
'women',
'women_of_reproductive_age_15_49',
'youth_15_24',
]
struct = StructType([
StructField('latitude', FloatType(), True),
StructField('longitude', FloatType(), True),
StructField('population', FloatType(), True),
StructField('country', StringType(), True),
StructField('demographic', StringType(), True),
])
df = spark.createDataFrame(sc.emptyRDD(), struct)
for country in COUNTRIES:
for demographic in DEMOGRAPHICS:
path = (f'{prefix}country={country}/type={demographic}/' +
f'{country}_{demographic}.csv.gz')
_df = spark.read.csv(path, sep='\t', header=True, schema=struct)
_df = _df.withColumn('country', lit(country))
_df = _df.withColumn('demographic', lit(demographic))
df = df.unionAll(_df)
The dataset contains the following columns and their descriptions.
Philippine, Japan and South Korea Population Density Dataset:
| Column Name | Description |
|---|---|
| latitude | The latitude (EPSG:4326/WGS84) coordinates of the center of the 1-arc-second-by-1-arc-second grid cell |
| longitude | The longitude (EPSG:4326/WGS84) coordinates of the center of the 1-arc-second-by-1-arc-second grid cell |
| population | The value is the (statistical) number of people in that grid of coordinates |
| country | Name of Country |
| demographic | Demography/Division of Population |
df.show(10)
+---------+---------+----------+-------+-------------------+ | latitude|longitude|population|country| demographic| +---------+---------+----------+-------+-------------------+ |41.774307|140.79263| 0.1722968| JPN|children_under_five| |41.758472|140.86958| 0.1722968| JPN|children_under_five| |41.779305|140.78986| 0.1722968| JPN|children_under_five| |41.759583|140.71764| 0.1722968| JPN|children_under_five| |41.741528|140.92236| 0.1722968| JPN|children_under_five| |41.768196|140.70375| 0.1722968| JPN|children_under_five| | 41.78014|140.79736| 0.1722968| JPN|children_under_five| |41.757637|141.07625| 0.1722968| JPN|children_under_five| | 41.76736|140.81541| 0.1722968| JPN|children_under_five| |41.776527|140.76347| 0.1722968| JPN|children_under_five| +---------+---------+----------+-------+-------------------+ only showing top 10 rows
Given the sheer number of rows in the data, a small sample is taken to allow us to visualize and have a glimpse of the information. We will be only checking a fraction of the dataset around 0.005% since parsing all through it would take a long time
%%spark -o data1
df1 = df.filter(df.demographic == "total_population")
sample_all = {'JPN': counry_frac, 'KOR': counry_frac, 'PHL': counry_frac}
data1 = df1.sampleBy('country', sample_all)
%%local
title = 'Population Density of PH, JPN, KOR'
center = {'lat': 26.3607036, 'lon': 127.7439425}
fig = make_map(data1, color="country", height=900, zoom=3.5,
color_continuous_scale=px.colors.cyclical.IceFire,
center=center, title=title)
fig.show()
This visualization shows all countries included in the dataset, with a sample of their respective population densities. Some areas are less populated than others while the most dense areas are usually near the capital city of each country.
A view of each country's population density distribution follows using plotly are as follows.
%%spark -o df2
df2 = df.filter(df.demographic != "total_population")
%%spark -o datajp
from operator import attrgetter
datajp = get_data(df2, 'JPN', sample_jp)
%%local
# Plot the data into a map
title = 'Population Density of Japan per Demographic'
fig = make_map(datajp, title, zoom=5)
fig.show()
In Japan, the most dense areas aside from Tokyo seem to be Osaka and Nagoya. Aside form that, there seems to be an outlier in the city of Matsue where there seems to be a dense concentration of women. The population distribution covers the whole country except in some mountainous regions. The most dense areas concentrate around where the shinkansen passes.
We will be using the the DataJP used to show the graph above and create ML Model that will predict the population density.
Sample head of DataJP
#show dataph header
datajp.show(10)
+---------+---------+----------+-------+-----------+ | latitude|longitude|population|country|demographic| +---------+---------+----------+-------+-----------+ |35.546528|133.23347| 52.085503| JPN| women| |35.545696| 133.2243| 52.085503| JPN| women| |35.547085|133.22874| 52.085503| JPN| women| |35.546806|133.23152| 52.085503| JPN| women| | 35.54514|133.22626| 52.085503| JPN| women| |35.545696| 133.2268| 52.085503| JPN| women| |35.547916|133.23264| 52.085503| JPN| women| |35.546528| 133.2343| 52.085503| JPN| women| |35.547638| 133.2393| 52.085503| JPN| women| |35.547085|133.22736| 52.085503| JPN| women| +---------+---------+----------+-------+-----------+ only showing top 10 rows
Schema of DataJP
print(f'Total Columns: {len(datajp.dtypes)}')
datajp.printSchema()
Total Columns: 5 root |-- latitude: float (nullable = true) |-- longitude: float (nullable = true) |-- population: float (nullable = true) |-- country: string (nullable = true) |-- demographic: string (nullable = true)
Converting all Strings Type to Category Type
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.regression import (
LinearRegression,
DecisionTreeRegressor,
RandomForestRegressor
)
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import MinMaxScaler
# Convert String Value to Category Type
cat = StringIndexer(inputCol='demographic', outputCol='demo_cat')
cat1 = cat.fit(datajp)
datajp_1 = cat1.transform(datajp)
datajp_1.show(5)
+---------+---------+----------+-------+-----------+--------+ | latitude|longitude|population|country|demographic|demo_cat| +---------+---------+----------+-------+-----------+--------+ |35.546528|133.23347| 52.085503| JPN| women| 0.0| |35.545696| 133.2243| 52.085503| JPN| women| 0.0| |35.547085|133.22874| 52.085503| JPN| women| 0.0| |35.546806|133.23152| 52.085503| JPN| women| 0.0| | 35.54514|133.22626| 52.085503| JPN| women| 0.0| +---------+---------+----------+-------+-----------+--------+ only showing top 5 rows
Converting the needed Features in a Vector Feature.
For this data, we will be using the latitude, longitude and also the demography as the main features for all our ML Models.
# Determine the Features
feature_list = ['latitude', 'longitude','demo_cat']
#Using Vector Assembler to Joine all needed Features
assembler = VectorAssembler(inputCols=feature_list, outputCol="features")
datajp_3 = assembler.transform(datajp_1)
datajp_3.show(5)
+---------+---------+----------+-------+-----------+--------+--------------------+ | latitude|longitude|population|country|demographic|demo_cat| features| +---------+---------+----------+-------+-----------+--------+--------------------+ |35.546528|133.23347| 52.085503| JPN| women| 0.0|[35.5465278625488...| |35.545696| 133.2243| 52.085503| JPN| women| 0.0|[35.5456962585449...| |35.547085|133.22874| 52.085503| JPN| women| 0.0|[35.5470848083496...| |35.546806|133.23152| 52.085503| JPN| women| 0.0|[35.5468063354492...| | 35.54514|133.22626| 52.085503| JPN| women| 0.0|[35.5451393127441...| +---------+---------+----------+-------+-----------+--------+--------------------+ only showing top 5 rows
Scaling the Data using MinMax Scaling
# MinMax Scaler
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
#fitting The Data
scaler_model = scaler.fit(datajp_3)
#transform the Data
datajp_4 = scaler_model.transform(datajp_3)
datajp_4.show(5)
+---------+---------+----------+-------+-----------+--------+--------------------+--------------------+ | latitude|longitude|population|country|demographic|demo_cat| features| scaledFeatures| +---------+---------+----------+-------+-----------+--------+--------------------+--------------------+ |35.546528|133.23347| 52.085503| JPN| women| 0.0|[35.5465278625488...|[0.53244013515232...| |35.545696| 133.2243| 52.085503| JPN| women| 0.0|[35.5456962585449...|[0.53240065611266...| |35.547085|133.22874| 52.085503| JPN| women| 0.0|[35.5470848083496...|[0.53246657524311...| |35.546806|133.23152| 52.085503| JPN| women| 0.0|[35.5468063354492...|[0.53245335519771...| | 35.54514|133.22626| 52.085503| JPN| women| 0.0|[35.5451393127441...|[0.53237421602188...| +---------+---------+----------+-------+-----------+--------+--------------------+--------------------+ only showing top 5 rows
Splitting the Data in Train and Test Data
(trainingData, testData) = datajp_4.randomSplit([0.8, 0.2])
We will using the Linear Regression Model of pyspark to attempt to create a prediction Model.
lr = LinearRegression(labelCol="population", featuresCol="scaledFeatures")
lr_model = lr.fit(trainingData)
print('Finished Training')
trainingSummary = lr_model.summary
trainingSummary.rootMeanSquaredError
print('')
print('Linear Regression Accuracy Results:')
print('')
print("RMSE of Traning: %f" % trainingSummary.rootMeanSquaredError)
print("r2 of Training: %f" % trainingSummary.r2)
testsum = lr_model.evaluate(testData)
print('')
print("RMSE of Test: %f" % testsum.rootMeanSquaredError)
print("r2 of Test: %f" % testsum.r2)
Finished Training Linear Regression Accuracy Results: RMSE of Traning: 7.879756 r2 of Training: 0.112512 RMSE of Test: 7.931902 r2 of Test: 0.100798
dt = DecisionTreeRegressor(labelCol="population", featuresCol="scaledFeatures")
evaluator_rmse = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='rmse')
evaluator_r2 = RegressionEvaluator(
predictionCol="prediction", labelCol="population", metricName='r2')
dtparamGrid = (ParamGridBuilder().addGrid(
dt.maxDepth, [2, 4, 6, 8, 10, 12])).build()
dtcv = CrossValidator(estimator = dt,
estimatorParamMaps = dtparamGrid,
evaluator = evaluator_r2,
numFolds = 5)
dt_model = dtcv.fit(trainingData)
print('Finished Training')
dt_model_train = dt_model.transform(trainingData)
dt_model_test = dt_model.transform(testData)
print('')
print('Decision Tree Accuracy Results:')
print('')
print('RMSE of Training:', evaluator_rmse.evaluate(dt_model_train))
print('R2 Score of Training:', evaluator_r2.evaluate(dt_model_train))
print('\n')
print('RMSE of Test:', evaluator_rmse.evaluate(dt_model_test))
print('R2 Score of Test:', evaluator_r2.evaluate(dt_model_test))
print('\n')
print('Best Model:',dt_model.bestModel)
Finished Training Decision Tree Accuracy Results: RMSE of Training: 0.5111298692551856 R2 Score of Training: 0.9962657847165292 RMSE of Test: 0.7611303424482431 R2 Score of Test: 0.9917201861145581 Best Model: DecisionTreeRegressionModel (uid=DecisionTreeRegressor_cc534df2b633) of depth 12 with 1577 nodes
dt = RandomForestRegressor(labelCol="population", featuresCol="scaledFeatures")
evaluator_rmse = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='rmse')
evaluator_r2 = RegressionEvaluator(
predictionCol="prediction", labelCol="population", metricName='r2')
dtparamGrid = (ParamGridBuilder().addGrid(
dt.maxDepth, [2, 4, 6, 8, 10, 12])).build()
dtcv = CrossValidator(estimator = dt,
estimatorParamMaps = dtparamGrid,
evaluator = evaluator_r2,
numFolds = 5)
dt_model = dtcv.fit(trainingData)
print('Finished Training')
dt_model_train = dt_model.transform(trainingData)
dt_model_test = dt_model.transform(testData)
print('')
print('Random Forest Regressor Accuracy Results:')
print('')
print('RMSE of Training:', evaluator_rmse.evaluate(dt_model_train))
print('R2 Score of Training:', evaluator_r2.evaluate(dt_model_train))
print('\n')
print('RMSE of Test:', evaluator_rmse.evaluate(dt_model_test))
print('R2 Score of Test:', evaluator_r2.evaluate(dt_model_test))
print('\n')
print('Best Model:',dt_model.bestModel)
Finished Training Random Forest Regressor Accuracy Results: RMSE of Training: 1.9215652136331758 R2 Score of Training: 0.9472227099918229 RMSE of Test: 2.142031445286372 R2 Score of Test: 0.9344226008839919 Best Model: RandomForestRegressionModel (uid=RandomForestRegressor_714d8dc162a5) with 20 trees
%%spark -o datakr
from operator import attrgetter
datakr = get_data(df2, 'KOR', sample_kr)
%%local
# Plot the data into a map
title = 'Population Density of Korea per Demographic'
fig = make_map(datakr, title)
fig.show()
In South Korea, the most dense area aside from Seoul is Busan. Aside form that, there are also a concentration popuation in Daegu and Gwangju. The population distribution covers the whole country except for the northeast region, which is also a mountainous region. Interestingly, the Korail also connects the most populated areas in South Korea.
The processing done on the japan dataset will be also done on the korea data, shown by the cell below:
#show dataph header
datakr.show(10)
+---------+----------+----------+-------+-----------+ | latitude| longitude|population|country|demographic| +---------+----------+----------+-------+-----------+ |37.535694| 126.8857| 21.965706| KOR| women| | 37.51875| 126.87708| 21.965706| KOR| women| |37.514305| 126.87375| 21.965706| KOR| women| | 37.53125| 126.87986| 21.965706| KOR| women| | 37.53097|126.881805| 21.965706| KOR| women| |37.517918|126.873474| 21.965706| KOR| women| |37.539585| 126.88153| 21.965706| KOR| women| |37.541527| 126.87319| 21.965706| KOR| women| |37.517082|126.876526| 21.965706| KOR| women| | 37.52514| 126.87625| 21.965706| KOR| women| +---------+----------+----------+-------+-----------+ only showing top 10 rows
Preprocessing
# Convert String Value to Category Type
cat = StringIndexer(inputCol='demographic', outputCol='demo_cat')
cat1 = cat.fit(datakr)
datakr_1 = cat1.transform(datakr)
# Determine the Features
feature_list = ['latitude', 'longitude','demo_cat']
#Using Vector Assembler to Joine all needed Features
assembler = VectorAssembler(inputCols=feature_list, outputCol="features")
datakr_3 = assembler.transform(datakr_1)
# MinMax Scaler
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
#fitting The Data
scaler_model = scaler.fit(datakr_3)
#transform the Data
datakr_4 = scaler_model.transform(datakr_3)
datakr_4.show(5)
+---------+----------+----------+-------+-----------+--------+--------------------+--------------------+ | latitude| longitude|population|country|demographic|demo_cat| features| scaledFeatures| +---------+----------+----------+-------+-----------+--------+--------------------+--------------------+ |37.535694| 126.8857| 21.965706| KOR| women| 0.0|[37.5356941223144...|[0.82532240865999...| | 37.51875| 126.87708| 21.965706| KOR| women| 0.0|[37.5187492370605...|[0.82208264105971...| |37.514305| 126.87375| 21.965706| KOR| women| 0.0|[37.5143051147460...|[0.82123294964721...| | 37.53125| 126.87986| 21.965706| KOR| women| 0.0|[37.53125,126.879...|[0.82447271724749...| | 37.53097|126.881805| 21.965706| KOR| women| 0.0|[37.5309715270996...|[0.82441947478130...| +---------+----------+----------+-------+-----------+--------+--------------------+--------------------+ only showing top 5 rows
#splitting the data to training and test
(trainingData, testData) = datakr_4.randomSplit([0.8, 0.2])
lr = LinearRegression(labelCol="population", featuresCol="scaledFeatures")
lr_model = lr.fit(trainingData)
print('Finished Training')
trainingSummary = lr_model.summary
trainingSummary.rootMeanSquaredError
print('')
print('Linear Regression Accuracy Results:')
print('')
print("RMSE of Traning: %f" % trainingSummary.rootMeanSquaredError)
print("r2 of Training: %f" % trainingSummary.r2)
testsum = lr_model.evaluate(testData)
print('')
print("RMSE of Test: %f" % testsum.rootMeanSquaredError)
print("r2 of Test: %f" % testsum.r2)
Finished Training Linear Regression Accuracy Results: RMSE of Traning: 4.143654 r2 of Training: 0.130265 RMSE of Test: 3.827081 r2 of Test: 0.150023
dt = DecisionTreeRegressor(labelCol="population", featuresCol="scaledFeatures")
evaluator_rmse = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='rmse')
evaluator_r2 = RegressionEvaluator(
predictionCol="prediction", labelCol="population", metricName='r2')
dtparamGrid = (ParamGridBuilder().addGrid(
dt.maxDepth, [2, 4, 6, 8, 10, 12])).build()
dtcv = CrossValidator(estimator = dt,
estimatorParamMaps = dtparamGrid,
evaluator = evaluator_r2,
numFolds = 5)
dt_model = dtcv.fit(trainingData)
print('Finished Training')
dt_model_train = dt_model.transform(trainingData)
dt_model_test = dt_model.transform(testData)
print('')
print('Decision Tree Accuracy Results:')
print('')
print('RMSE of Training:', evaluator_rmse.evaluate(dt_model_train))
print('R2 Score of Training:', evaluator_r2.evaluate(dt_model_train))
print('')
print('RMSE of Test:', evaluator_rmse.evaluate(dt_model_test))
print('R2 Score of Test:', evaluator_r2.evaluate(dt_model_test))
print('')
print('Best Model:',dt_model.bestModel)
Finished Training Decision Tree Accuracy Results: RMSE of Training: 0.652805998697664 R2 Score of Training: 0.9784131894589513 RMSE of Test: 1.4796315674171507 R2 Score of Test: 0.8729487095076395 Best Model: DecisionTreeRegressionModel (uid=DecisionTreeRegressor_2d00defcea7e) of depth 12 with 1615 nodes
dt = RandomForestRegressor(labelCol="population", featuresCol="scaledFeatures")
evaluator_rmse = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='rmse')
evaluator_r2 = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='r2')
dtparamGrid = (ParamGridBuilder().addGrid(
dt.maxDepth, [2, 4, 6, 8, 10, 12])).build()
dtcv = CrossValidator(estimator = dt,
estimatorParamMaps = dtparamGrid,
evaluator = evaluator_r2,
numFolds = 5)
dt_model = dtcv.fit(trainingData)
print('Finished Training')
dt_model_train = dt_model.transform(trainingData)
dt_model_test = dt_model.transform(testData)
print('')
print('Random Forest Regressor Accuracy Results:')
print('')
print('RMSE of Training:', evaluator_rmse.evaluate(dt_model_train))
print('R2 Score of Training:', evaluator_r2.evaluate(dt_model_train))
print('')
print('RMSE of Test:', evaluator_rmse.evaluate(dt_model_test))
print('R2 Score of Test:', evaluator_r2.evaluate(dt_model_test))
print('')
print('Best Model:',dt_model.bestModel)
Finished Training Random Forest Regressor Accuracy Results: RMSE of Training: 1.6338635683869065 R2 Score of Training: 0.8647766207081543 RMSE of Test: 1.7556688817310606 R2 Score of Test: 0.8211219189426743 Best Model: RandomForestRegressionModel (uid=RandomForestRegressor_ab3e6ee6b4bc) with 20 trees
%%spark -o dataph
from operator import attrgetter
dataph = get_data(df2, 'PHL', sample_ph)
%%local
# Plot the data into a map
title = 'Population Density of Philippines per Demographic'
fig = make_map(dataph, title, zoom=4)
fig.show()
In the Philippines, the densely populated areas are limited to Metro Manila and Davao. An outline of the country can't be clearly seen. Most dense areas outside of Manila and Davao are composed of men, with the exception of Sulu which has a concentration of women. Compared to the previous two countries, we don't have a national railroad system. Also compared to the previous two countries, the disparity between the places where people congregate and the rest of the country is apparent, with the markers in other places almost invisible.
#show dataph header
dataph.show(10)
+---------+----------+----------+-------+--------------------+ | latitude| longitude|population|country| demographic| +---------+----------+----------+-------+--------------------+ |14.742361| 121.22014| 9528.903| PHL| women| |14.757083|121.240974| 9528.903| PHL| women| |14.741806|121.218475| 9528.903| PHL| women| |14.742361| 121.22014| 9253.292| PHL| men| |14.757083|121.240974| 9253.292| PHL| men| |14.741806|121.218475| 9253.292| PHL| men| | 15.04875| 120.7307| 8415.362| PHL| men| | 15.04875| 120.7307| 7800.2007| PHL| women| |14.742361| 121.22014| 5319.835| PHL|women_of_reproduc...| |14.757083|121.240974| 5319.835| PHL|women_of_reproduc...| +---------+----------+----------+-------+--------------------+ only showing top 10 rows
Preprocessing
# Convert String Value to Category Type
cat = StringIndexer(inputCol='demographic', outputCol='demo_cat')
cat1 = cat.fit(dataph)
dataph_1 = cat1.transform(dataph)
# Determine the Features
feature_list = ['latitude', 'longitude', 'demo_cat']
#Using Vector Assembler to Joine all needed Features
assembler = VectorAssembler(inputCols=feature_list, outputCol="features")
dataph_3 = assembler.transform(dataph_1)
# MinMax Scaler
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
#fitting The Data
scaler_model = scaler.fit(dataph_3)
#transform the Data
dataph_4 = scaler_model.transform(dataph_3)
dataph_4.show(5)
+---------+----------+----------+-------+-----------+--------+--------------------+--------------------+ | latitude| longitude|population|country|demographic|demo_cat| features| scaledFeatures| +---------+----------+----------+-------+-----------+--------+--------------------+--------------------+ |14.742361| 121.22014| 9528.903| PHL| women| 1.0|[14.7423610687255...|[0.68624276785092...| |14.757083|121.240974| 9528.903| PHL| women| 1.0|[14.7570829391479...|[0.68726293550880...| |14.741806|121.218475| 9528.903| PHL| women| 1.0|[14.7418060302734...|[0.68620430587146...| |14.742361| 121.22014| 9253.292| PHL| men| 0.0|[14.7423610687255...|[0.68624276785092...| |14.757083|121.240974| 9253.292| PHL| men| 0.0|[14.7570829391479...|[0.68726293550880...| +---------+----------+----------+-------+-----------+--------+--------------------+--------------------+ only showing top 5 rows
#splitting the data to training and test
(trainingData, testData) = dataph_4.randomSplit([0.8, 0.2])
lr = LinearRegression(labelCol="population", featuresCol="scaledFeatures")
lr_model = lr.fit(trainingData)
print('Finished Training')
trainingSummary = lr_model.summary
trainingSummary.rootMeanSquaredError
print('')
print('Linear Regression Accuracy Results:')
print('')
print("RMSE of Traning: %f" % trainingSummary.rootMeanSquaredError)
print("r2 of Training: %f" % trainingSummary.r2)
testsum = lr_model.evaluate(testData)
print('')
print("RMSE of Test: %f" % testsum.rootMeanSquaredError)
print("r2 of Test: %f" % testsum.r2)
Finished Training Linear Regression Accuracy Results: RMSE of Traning: 628.639683 r2 of Training: 0.020206 RMSE of Test: 459.188996 r2 of Test: -0.002987
dt = DecisionTreeRegressor(labelCol="population", featuresCol="scaledFeatures")
evaluator_rmse = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='rmse')
evaluator_r2 = RegressionEvaluator(
predictionCol="prediction", labelCol="population", metricName='r2')
dtparamGrid = (ParamGridBuilder().addGrid(
dt.maxDepth, [2, 4, 6, 8, 10, 12])).build()
dtcv = CrossValidator(estimator = dt,
estimatorParamMaps = dtparamGrid,
evaluator = evaluator_r2,
numFolds = 5)
dt_model = dtcv.fit(trainingData)
print('Finished Training')
dt_model_train = dt_model.transform(trainingData)
dt_model_test = dt_model.transform(testData)
print('')
print('Decision Tree Accuracy Results:')
print('')
print('RMSE of Training:', evaluator_rmse.evaluate(dt_model_train))
print('R2 Score of Training:', evaluator_r2.evaluate(dt_model_train))
print('')
print('RMSE of Test:', evaluator_rmse.evaluate(dt_model_test))
print('R2 Score of Test:', evaluator_r2.evaluate(dt_model_test))
print('')
print('Best Model:',dt_model.bestModel)
Finished Training Decision Tree Accuracy Results: RMSE of Training: 329.35581483655443 R2 Score of Training: 0.7310559681548491 RMSE of Test: 366.30288881828545 R2 Score of Test: 0.36174662183590667 Best Model: DecisionTreeRegressionModel (uid=DecisionTreeRegressor_2dbeb77024f8) of depth 8 with 389 nodes
dt = RandomForestRegressor(labelCol="population", featuresCol="scaledFeatures")
evaluator_rmse = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='rmse')
evaluator_r2 = RegressionEvaluator(predictionCol="prediction",
labelCol="population",
metricName='r2')
dtparamGrid = (ParamGridBuilder().addGrid(
dt.maxDepth, [2, 4, 6, 8, 10, 12])).build()
dtcv = CrossValidator(estimator = dt,
estimatorParamMaps = dtparamGrid,
evaluator = evaluator_r2,
numFolds = 5)
dt_model = dtcv.fit(trainingData)
print('Finished Training')
dt_model_train = dt_model.transform(trainingData)
dt_model_test = dt_model.transform(testData)
print('')
print('Random Forest Regressor Accuracy Results:')
print('')
print('RMSE of Training:', evaluator_rmse.evaluate(dt_model_train))
print('R2 Score of Training:', evaluator_r2.evaluate(dt_model_train))
print('')
print('RMSE of Test:', evaluator_rmse.evaluate(dt_model_test))
print('R2 Score of Test:', evaluator_r2.evaluate(dt_model_test))
print('')
print('Best Model:',dt_model.bestModel)
Finished Training Random Forest Regressor Accuracy Results: RMSE of Training: 462.8618961028636 R2 Score of Training: 0.46882931242081016 RMSE of Test: 370.0104362982871 R2 Score of Test: 0.34876102796101327 Best Model: RandomForestRegressionModel (uid=RandomForestRegressor_e7de68df4fb7) with 20 trees
We were able to plot the population density of each country and visualize it perfectly. In First graph, we can see that the total population visualization for Philippines greatly outweighs the visualization for South Korea and Japan. We can also see that for each country, there are a certain area where the population density is centered and are coupled to each other. These areas are very important to consider for each country since they can generate development and economic activity. The first graph also shows in what area are the population density the lowest. Using this visualization, they can target their projects and the government aids to those important area. From this point of view, we can already derive that the population density in the Philippines is much greater than South Korea and Japan.
Machine Learning Models:
JAPAN
The Machine Learning Models created for predicting Japan Population Density are divided into Linear Regression, Decision Tree Regressor and Random Forest Regressor. Based on the results, we can note that the highest accuracy is achieved in Decison Tree Regressor Model when compared to the other models. We should take note that for this research, we will set the linear regression as the baseline model for the machine learning process for all countries. We can see that the accuracy for the linear regression is very low which suggest that the latitutde and longitude are not linearly correlated with the population density data of japan.This goes the same for the demography of the japan dataset. This can be justified by the fact the training accuracy is within 11% and the test accuracy around 10%. It also seen that the root mean square error for the linear regression is quite big comparing the needed predicted value. It is seen also that for the Japan Dataset, we can see the highest accuracy on the Decision Tree Regressor with almost a 99% accuracy for both the training set and test set. For the Ramdom Forest Regressor, it is noted that the accuracy is a little bit lower at 93%
SOUTH KOREA
The Machine Learning Models created for predicting South Korea Population Density are divided into Linear Regression, Decision Tree Regressor and Random Forest Regressor. Based on the results, we can note that the highest accuracy is achieved in Decision Tree Model when compared to the other models. We can see that the accuracy for the linear regression is very low (the same with japan) which suggest that the latitutde and longitude are not linearly correlated with the population density data of south korea. This goes the same for the demography of the South Korea dataset. This can be justified by the fact that the training accuracy is within 13% and the test accuracy around 15%. It also seen that the root mean square error for the linear regression is quite big comparing the needed predicted value. It is seen also that for the South Korea Dataset, we can see the highest accuracy on the Decision Tree Regressor with almost a 97% and 87% accuracy for the training set and test set. For the Random Forest Regressor, it is noted that the accuracy is a little bit lower.
PHILIPPINE
The Machine Learning Models created for predicting Philippine Population Density are divided into Linear Regression, Decision Tree Regressor and Random Forest Regressor. Based on the results, we can note that the highest accuracy is achieved in Decision Tree Model when compared to the other models. We can see that the accuracy for the linear regression is very very low (same with japan and south korea) when comparing to the other countries which suggest that the latitutde and longitude are not linearly correlated with the population density data of philippines. This goes the same for the demography of the Philippines dataset. This can be justified by the fact the training accuracy is within 2% and the test accuracy around 2%. It also seen that the root mean square error for the linear regression is quite big comparing the needed predicted value. It is seen also that for the Philippine Dataset, we can see the highest accuracy on the Decision Tree Regressor with almost a 36% accuracy for the test set which suggest that the data of the population density for philippines is not as optimized as compared to the other countries. For the Ramdom Forest Regressor, it is noted that the accuracy is a little bit lower than the decision tree forest.
A dataset from Open Data Registry of AWS entitled High Density Population Maps Dataset was parsed and underwent data analysis, data preprocessing and further filtering. To answer the problem statement, if we can predict the population density using the latitude, longitude and demography features of the dataset. There are several machine learning models created to determine if the problem is answerable and also to compare each results and achieve good conclusion.
The results show the following conclusions:
[1] Yegorov, Yuri. (2015). Economic Role of Population Density. https://www.researchgate.net/publication/283637652_Economic_Role_of_Population_Density Accessed 29 Nov 2020.
[2] Facebook Connectivity Lab and Center for International Earth Science Information Network – CIESIN – Columbia University. 2016. High Resolution Settlement Layer (HRSL). Source imagery for HRSL © 2016 DigitalGlobe. https://dataforgood.fb.com/docs/high-resolution-population-density-maps-demographic-estimates-documentation/ Accessed 29 Nov 2020.